-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder tests in maybe_downcast_numeric #55825
Conversation
The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See pandas-dev#55824 Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
While there is a conversation going on in #55824, these changes introduce no new functionality nor user-visible changes (other than tolerating Pint's behavior). All these changes do is remove a dependency on the behavior of |
Cool thanks for the PR. Can you add a test and a whatsnew note? |
Would be happy to, but...the test case would require loading Pint. Last time I tried to do some test cases with alien packages I got back a big "No thanks!" from the pandas maintainers. Also, another PR I submitted (enabling many complex test cases) rejected the WhatsNew changes I wrote because there were no user-visible changes. Would you like to guide me further? |
pandas/core/dtypes/cast.py
Outdated
if isna(r[0]): | ||
# do a test on the first element, if it fails then we are done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test on the first element goes back to dc73315#diff-fb8a9c322624b0777f3ff7e3ef8320d746b15a2a0b80b7cab3dfbe2e12e06daa in core.common
. It appears to me it is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I constructed a simple test case and I think I understand what the earlier code was trying to do, which was to ravel
only the first element of result
. Consider:
import numpy as np
import pandas as pd
aa = np.array([1.0, np.nan, 2.0])
bb = np.array([aa, aa])
print(aa, pd.isna(aa))
# [ 1. nan 2.] [False True False]
print(bb, pd.isna(bb))
# [[ 1. nan 2.] [ 1. nan 2.]] [[False True False] [False True False]]
print(pd.isna(bb).any())
# True
The fact that pd.isna(bb).any()
sees the nulls inside the sub-arrays means we don't need to ravel the whole thing to look for nulls. The reason to ravel, instead, is to look at the type of the elements, and we only need to ravel the first element to look into that. (We have already tested that there are elements to ravel.)
But...decimal.Decimal
doesn't have a ravel
, which explains perhaps why the original code tried to wrap just the first element back into an array to ravel that.
If the first element of `result` is an array, ravel that to get element we will test. Otherwise use it as is. We only need to check whether `result` is all non-null once. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
Don't use deprecated array indexing on ExtensionArrays. We need to now us `iloc`. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
pandas/core/dtypes/cast.py
Outdated
|
||
elif not isinstance(r[0], (np.integer, np.floating, int, float, bool)): | ||
if isinstance(result, np.ndarray): | ||
element = result[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would result.item(0)
work here? might avoid potential copies from ravel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion here turns out to cause a different problem bc .item(0) is wrong for dt64 ndarrays
result = np.arange(5).view("M8[ns]")
>>> result.item(0)
0
When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way. We can also eliminates an additional conditional check. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>
So to wrap this up...I don't think it's easy to add a test case because that would involve installing Pint, which might not be very CI/CD-friendly. If there's a good way to do that, please point me to a pattern I can follow. It's easy for me to describe the change in CHANGES, but its not really user-visible. But because there is an issue (though no test case), I could write it up anyway. Thoughts? |
With the current implementation in this PR, I wouldn't expect any behavior changes. You good here? It does change for Pint because there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @MichaelTiemannOSC |
The comment
# if we have any nulls, then we are done
is not consistent with the testif isna(arr).any()
becausearr
is constructed only from the first element (r[0]
) not the full ravel'd list of values. Moreover, callingnp.array()
on some random type can have surprising consequences.So instead, do the early-out test as intended, just using
r[0]
without going throughnp.array()
. Then test other things aboutr[0]
. Only then should we test all the values (and if we have any nulls, then we are done).See #55824
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.